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ABSTRACT 



When the Educational Testing Service became the 



administrator of the National Assessment of Educational Progress 
(NAEP) in 1983, it introduced scales based ou item response theory 
(IRT) as a way of presenting results of the assessment to the general 
public. Some properties of the scales and their uses are discussed. 
Initial attempts at presenting the assessment results reported the 
percent correct statistics for each individual item. IRT-based NAEP 
scales avoid the problems of average percent correct statistics and 
summarize complex information by reducing a large data set into a few 
manageable and interpretable summary statistics. The dimensionality 
of data sets has been studied to protect against losing important 
information in the summarization process. Research into scaling has 
demonstrated that, in most cases, the data support creating a single 
developmental scale. The NAEP also produces a global summary, using a 
weighted average of the domain scores. Scale anchoring is 
accomplished by describing what students at selected levels know and 
can do. A typical anchoring process is described. Three figures 
illustrate the use of NAEP scales. (SLD) 
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1. INTRODUCTION 

An educational assessment is fundamentally an information system 
designed to report to educational practitioners, policymakers, and the general 
public on the status of, and changes in, the performarxe of students. As an 
information system, it must be concerned that the information produced and 
reported is useful to its audiences. Of course, the information presented 
must be accurate. But furthermore, a successful information system needs to 
report results in a concise and understandable way if it is to satisfy the 
various information needs which it attempts to serve. 



The National Assessment of Educational Progress (NAEP) was designed to 
report what students in American schools (both public and private) know and 
can do. Obviously, not all of the proficiei les of students can be assessed, 
and so a sample of exercises in school subject areas is selected to represent 
the overall proficiencies of students. The selection of exercises is very 
important, and NAEP selects the exercises using a consensus approach. A large 
committee of learning area specialists, educators, and concerned citizens is 
formed for each subject area assessed. These learning arja committees specify 
the objectives for the assessment in terms of goals that students should 
achieve during the course of their education. The objectives for a subject 
area are typically defined in terms of a content-by-process area matrix for 
which the approximate percentage of items of each type is specified by the 
learning area committee. In order to satisfy the objectives of the assessment 
and ensure that the tasks selected to measure each goal cover a range of 
difficulty levels, NAEP samples a large number of exercises within each school 
subject area that it assesses. For example, recent assessments of reading, 
mathematics, science, history, and civics were each based on several hundred 
exercises . 



Simply reporting all of the data that NAEP collects to all of its 
audiences would be clearly inefficient and unresponsive. It would be 
inefficient because many different users would be required to analyze and 
summarize the data. It would be unresponsive since different audiences have 
different needs, and so information must often be tailored for them. Some 
NAEP users, such as educational test developers and curriculum specialists, 
may want detailed Information for specialized purposes that were unknown and 
unpredictable at the time the assessment was designed, and so all of the data 
are made available on Public-Use Data Tapes, except for information that would 
compromise the anonymity of NAEP participants. Other information users, such 
as policy-makers and policy analysts, want more concise information that 
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reduces the great volume of collected data into a form that is clear and 
understandable and yet captures the main findings of the assessment • 

There are many ways in which data may be summarized for general 
reporting. NAEP originally attempted to summarize the data by reporting the 
percentages of correct responses for individual items and later by average 
percentages over groups of items, but this method was found wanting for 
reasons to be given below. When the Educational Testing Service (ETS) became 
the NAEP administrator in 1983, it introduced scales based on item response 
theory (IRT) as a way of presenting results to the general public. These NAEP 
scales have now been widely accepted by many users of educational information. 

In this paper, some properties of the IRT scales and their uses will be 
discussed. First, we will address the problems of presenting the results by 
the percentage of students who answered individual items correctly and also by 
averages of percentages-correct for groups of items • We will then discuss how 
the IRT captures the main features of the assessment data by scaling and how 
the results are reported. Some of the issues determining whether global 
scores or more detailed subscores will be discussed as well as the decision to 
make developmental scales that span different age levels. Finally, the use of 
scale anchoring as a way of interpreting the meaning of the scale will be 
presented. Further details on the characteristics of the NAEP scales can be 
found in the NAEP technical reports (Beaton, 1987, 1988; Johnson and Zwick, 
1990). 



2. ALTERNATIVE METHODS OF REPORTING 

The initial attempts at presenting the assessment results reported 
percent correct statistics for each individual item. An (extreme) example of 
this type of reporting is given in Figure 1 which shows, for a single 
mathematics item, the percent of students, by subgroup of the population, 
choosing each possible response to the item. Obviously, Figure 1 provides a 
great deal of information (and, in fact, reports of this sort are routinely 
produced). However, this type of reporting quickly proved tco cumbersome, and 
hard for the interested public (policy makers, educational researchers, 
interested individuals) to interpret. 

For many of the constituents of NAEP, the level of detail provided by 
individual item level reporting is excessive, and may be counterproductive. 
Wirtz and Lapointe (1982) report about a rebellion of Washington D.C. parents 
when their children brought home mid-term report cards in the form of 4-page 
computer printouts indicating how the child was doing on a each item in a long 
list. Included in the report was an objectives check list telling, for 
example, whether or not the child could write a 2 digit number as a sum in 
which one addend is the next lower unit of ten or whether or not the child 
could identify initial consonant substitution in reading. Wirtz and Lapointe 
report that school offices were deluged with requests about how to interpret 
all this information. 
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The problem with an item by item approach of reporting is that it 
ignores overarching similarities in trends and group comparisons that are 
common across items — exactly what an assessment is supposed to identify, 
Bock, Mislevy and Woodson (1982) distinguish between the "f Ixed-itera" approach 
of reporting and the "random-item" concept. The fixed item approach assumes 
that each individual item is of primary interest of itself - the model for 
this approach is survey research where each response (opinions on some issue 
for example) is of unique interest. Educational items generally do not share 
this unique importance. Rather, the items are viewed as random 
representatives of a conceptually infinite pool of items within the same 
domain and of the same type* In this ranucm item concept, a set of items is 
taken to represent the domain of interest. 

Having moved away from primary emphasis on individual items to domains 
represented by sets of items, there is the need to have some measure of the 
achievement vithin each domain. An obvious index is the average percent 
correct across all presented items within the domain of interest. As noted by 
Bock, Mislivy and Woodson, the averaging tends to cancel out the effects of 
peculiari';ies in item writing which can affect item difficulty in 
unpredic'.able ways and produces the central tendency of the distribution of 
correct responses across the presented items as the measure of the overall 
achievfiment within the domain. 

As noted by Wirtz and Lapointe, the process of aggregation for the 
reporting of NAEP results was generally applauded. Media coverage of 
assessment results dramatically increased when aggregate results were 
reported. This was at least in part because the aggregate reports provide the 
broad picture that the media and the public alike seem to consider more 
interesting and informative than individual item-by-item results. However, 
there are a number of significant problems with average percent correct 
scores . 

First, the interpretation of average percent correct results depends on 
the selection of items; the selection of easy or difficult items could make 
student performance look good or bad - to a public accustomed to a passing 
score. Second, the average percent correct metric is obviously tied to the 
particular items going into the average. This means that age-to-age and year- 
to-year comparisons require the same exercises. The consequence of this is 
that measurement of trends in achievement based on an average percent correct 
metric is limited to either: 

1) considering trends on the small proportion of items common across 
years (a proportion which is continually diminishing because of the need 
to release items and the fact that some items become outdated) 

or 

2) pairwise comparisons from one year to the next based on the shared 
items. This can lead to different and incomparable percent-correct 
scales for different comparisons. As an example, trends in science 
pchievement for the 1970, 1973 and 1977 assessments had to be expressed 
in terms of the relative change from one year to the next. 
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Finally, it is difficult to speak in terms of the distribution of 
proficiencies in the population when the measure used is the average percent 
correct* As noted in the report of the NAEP planning project under the 
auspices of the Council of Chief State School Officers (1988), reporting an 
average score for a population provides, by itself, very little information to 
the public, policy--makers or practitioners, Averages from two or more 
assessments can indicate trends in central tendency, but provide no 
information about how the performance of the students is distributed, What is 
needed is information about trends in the distribution of student achievement, 
that is changes in the proportions of students with subject area abilities at 
or above specified levels, 

The IRT-based NAEP scales provide a meaningful description of 
achievement while avoiding the problems of average percent correct statistics* 
The NAEP scales allow all students to be placed on a common scale even though 
none of the respondents take all of the items within the pool. This is 
important because it allows us to measure trend in terms of a consistent 
metric even though the item pool evolves over time. Furthermore, we can 
estimate the distribution of skills among students and provide meaningful 
interpretations about score levels in terms of predicted performance on 
specific exercises and in terms of educationally relevant tasks. 



3* THE NAEP SCALES 

Before proceeding, it is important to note that, with real data, no 
summary Is able to capture all of the information that is available in the 
data. The most common statistic for summarizing data on a single variable, in 
education or other research areas, is the arithmetic average or mean, which 
captures only a small part of the information that is available in an entire 
data set. The mean is the single number that reproduces the original data 
most accurately in the least squares sense. Using two statistics, the mean 
and the standard deviation, gives more information about the available data, 
but not all, unless the distribution of data can be described by a known 
distribution such as the normal. Percentiles give even more information, but 
still do not completely describe the original data when the sample size is 
large. Summarizing, therefore, always loses some of the detailed information 
that is available in the full data set. 

Although scaling educational data inevitably loses some information 
about the performance of examinees, it is nonetheless an efficient way of 
summarizing the comple:^ assessment data. In most sets of data resulting from 
educational tests, there are regularities in the data. Response patterns tend 
towards a triangular shape in a subject-by- it ems data matrix that is sorted 
both by the number of correct responses and item difficulty (see Figure 2), 
Some examinees anrwer a large number of items correctly, some answer only a 
few. Some items are "easy" in that they are answered correctly by most 
examinees, others are "hard" in that only a few persons answer them correctly. 
And in most data sets, there is an interaction between the examinees and the 
items, that is, the examinef^s who answer the difficult items correctly tend to 
answer the easy ones correctly too, but answering the easy items correctly 

4 



ERIC 



5 




does not imply answering the more difficult items. This is the regularity in 
the ^ata that IRT scaling attempts to encapsulate. 

Scaling is a process by which the data are reduced to a few statistics 
that summarize a large data set. One or more statistics may be defined to 
represent the proficiency of examinees on one or more dimensions of 
proficiency. One or more statistics may be defined to represent the 
characteristics of the different items. An example of a simple scale is the 
number-right score which represents the proficiency of examinees and is 
particularly useful when all examinees have responded to the same set of 
items. The proportion of examinees pas.sing an item is a statistic that 
describes a property of an item. 

In most applications of item response theory, a single scale score 
represents an examinee's proficiency, and this score can be used to estimate 
an examinee's response to any item in the examination. The relationship 
between an examinee's response to an item and his or her scale score is 
approximated by a non-linear function. The non-linear function, which is 
different for each item, is characterized by an item statistic or statistics 
defined for each item. These item statistics are estimates of item parameters 
defined over a population of examinees. The item statistics may estimate one 
parameter, as in the Rasch model, or three parameters, as in the IRT model 
used by NAEP. The three parameter logistic model that is used in NAEP has 
been found to fit many data sets reasonably well. 

No real data set is perfectly ordered and thus any scaliug procedure 
that reduces the test results into a few summary statistics must lose some 
information. Examinees who are generally able to answer items may have 
occasional gaps in their knowledge. Especially when multiple choice items are 
used, poorly performing students may occasionally respond correctly to 
difficult items, even if by chance. As we shall see below, these departures 
from the expected regularity should be studied separately to help understand 
what the scale values encompass and what they do not. In most cases, .*:he 
scale values will encompass most of the general information about item 
responses that is in the data. 

This is the real function of IRT in an assessment: to summarize complex 
information by reducing a large data set into a few manageable and 
interpretable summary statistics. From the individual scores, we can 
estimate — and to some degree reproduce — the individual's actual responses to 
items. Individuals with high scores tend to get many items correct; 
individuals with low scores do not. If different items were randomly assigned 
to examinees, then we can also estimate how an examinee would have performed 
on items that he or she was not administered. The score is, therefore, a 
simple summary of how the individual did on the maay items in the assessment. 

IRT scaling is not the only way to summarize data. When writing was 
assessed in 1984, several attempts were made to adapt IRT technology to the 
nonbinary writing exercise responses using the models proposed by Masters 
(1982). However, it was found that these models did not provide acceptable 
results for the NAEP data and so an alternate method of scaling called the 
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Average Response Method (ARM) of sc<aling was developed and used. The ARM 
method is fully described in Beaton and Johnson (1990). 

The NAEP report ing scale 

In addition to sununarizing the data, NAEP must report its results. An 
IRT scale is indeterminate; any linear transformation of the scale will 
reproduce the item responses equally well. By default, most computer programs 
assign the average scale score to be zero and the standard deviation to be one 
with the result that about half of the scale scores are usually negacive. 
Such an arbitrary reporting system would not be appropriate for public 
dissemination. 

NAEP reports its results as number-right scores on a hypothetical 500 
item test. Scores on this hypothetical test typically range between about 100 
to 400. The test has certain idealized properties: the item difficulties of 
this hypothetical test are distributed evenly across the range of observed 
performance and somewhat beyond, all items have the same discriminatory power, 
and there is no guessing. If such a test could be built, it would fit the 
Rasch model precisely. 

Using this hypothetical test, NAEP results can be interpreted as test 
scores on an idealized test. There are no negative scores. Although the 
number of items in the test is arbitrary, it was chosen to minimize confusion 
with IQ scores, SAT scores, grade equivalents, and other well-known test 
metrics . 



4. HULTIVARIABLE SCALING 

As mentioned above, the objectives of national assessment instruments 
are developed by an extensive consensus process that reviews the curriculum 
objectives of a subject area. In some cases, the assessment is designed to 
assess different types of proficiencies which may be taught at different 
levels or in different subject-area courses. In the 1986 mathematics and 
science assessments, the objectives committee decided that different sub- 
domains in each area were important enough to be assessed and reported 
separately. In mathematics, the sub-domains wore (1) Knowledge and skills in 
numbers and operations, (2) Higher level applications of numbers and 
operations, (3) Measurement, (4) Geometry, and (5) Algebra. In science, the 
sub-domains were (1) Life Sciences, (2) Physics, (3) Chemistry, (4) Earth and 
Space Sciences, and (5) the Nature of Science, The item response theory 
approach to scaling was adapted to fulfill the requirement of separate 
reporting by sub-domain. 

In the cases of mathematics and science, student exercises were 
developed to probe each sub-domain. The assessment items, therefore, could be 
classified by sub-domain. However, the amount of individual student time for 
NAEP is kept to about an hour and thus accurate measurement of students in 
each sub-domain was impossible. Simply applying existing IRT technology would 
have resulted in poor estimates of student performance in the several sub- 
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domains, The concepts of plausible values and conditioning (see, e,g,, 
Mislevy, 1990; Johnson. 1989) were applied to the NAEP mathematics and science 
assessments to generate, under certain reasonable assumptions, consistent 
estimators of attributes of the distributions of student performance in the 
sub-domains « 



5. DIMENSIONALITY 

As previously mentioned, summarizations necessarily lose some of the 
information that is available in the original data. The IRT scaling used in 
NAEP is intended to summarize the previously defined subject areas or sub- 
domains of those subject areas. In so doing, the IRT process assumes that 
there is a regul^.rity in the data in the subject area or domain that is 
scaled — but what if there is not? "^s there a better way to summarize the 
data, perhaps using several different scales? Can the data be summarized at 
all? 

NAEP has addressed the question of what to summarize and report through 
the study of the dimensionality of the data within a subject area or domain. 
After the 1984 assessment of reading, the dimensionality of the reading data 
was extensively studied, and it was concluded that reporting several reading 
sub-scales would not add important information to reporting one reading scale. 
Various studies of the several scales in mathematics and science have shown 
that the different subscales do provide additional information that a single 
scale would not — for example, gender differences in mathematics and science 
performance — and no reason has been found to subdivide further the domains 
that have been reported. Studying the dimensionality of data sets protects 
against losing important information in the summarization process. 



6. DEVELOPMENTAL SCALES 

When summarizing, a natural question arises as to what populations 
should be summarized over. NAEP has traditionally assessed 9-, 13-, and 17- 
year olds and has recently added overlapping samples of fourth, eighth, and 
twelfth grade students. Should the scales be separately defined at each age 
or grade level or can one scale span all three age and grade levels? An 
advantage of separate scales is that the information may be targeted to 
specific age levels. An advantage of a single scale is that performance and 
changes in performance of each grade level can be viewed in the context of 
student growth between age and grade levels, 

NAEP has approached this question empirically and found that in most 
cases the data support using a single, developmental scale. In designing the 
NAEP instruments, NAEP has included common items at adjacent age levels, that 
is, the fourth crd eighth grade assessments include some common items as do 
the eighth and the twelfth grade assessments. These items allow us to compare 
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the performance of the older students with that of the younger students, From 
the perspective of data summarization, the question Is whether or not a single 
Item response function fits adjacent age levels equally well; that is, does 
assigning a common scale score to students at adjacent age levels seriously 
affect the .curacy of the prediction of their responses on common items? 

Before a developmental scale is produced, the empirical estimates of 
item characteristic functions are obtained for each item for each age level 
that was presented the item. These empirical item characteristic functions 
are calculated without assuming any functional form thereby allowing checks of 
the assumptions of item parameter independence and unldimensionality. In a 
small number of cases (less than 4%) the empirical item characteristic 
functions differ between ages, In such cases, the item is treated as a 
separate item at each age. 

The research into scaling has demonstrated that> in most cases, the data 
do support creating developmental scales. Using this feature, we are able to 
display trends in student proficiency in reference to the differences between 
age levels. It has also been possible to compare the proficiencies of 
students at one age level to those of another. For most items (excluding 
those mentioned above) , we can say that students with a given scale score have 
approximately the same probal ility of answering the common items correctly 
^regardless of their ages or grades. If the common items can be assumed to be 
from the same item population as the other items, then we can estimate how 
students at one age level would perform if given items from another. 



7. GLOBAL SCORES 

As mentioned above, different levels of summarization are appropriate 
for different audiences. For very detailed analyses, the basic data are made 
available so that any data analyst may sumraari?:e the data in any way he or she 
thinks appropriate, NAEP has chosen to report mainly at the level of subject 
area and domain, and uses dimensionality analyses to investigate the 
information that is not included in the summary, However, when reporting is 
done by domains within a subject area> many policy-makers want a more global 
summary, and so NA^^P produces an overall summary also, 

To compute an overall summary, NAEP uses a weighted average of the 
domain scores. When the domains are defined, the learning area committee of 
subject matter specialists, educators, and others not only specifies the 
assessment objectives but also gives each objective a weight signifying its 
importance, These weights are used to determine the number of items in the 
assessment that will be used to measure each objective. These weights are 
also used in computing the overall summary of performance. 

It is clear that the overall summary does not contain all of the 
information in the more detailed domain scores — if it did, there would be no 
point to estimating performance in the separate domains. The domains 
themselves do not contain all of the information that more detailed summaries 
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might have, and so on. And yet, from some perspectives the overall summary is 
adequate and useful for some information purposes. 



8 . ANCHORING 

Scale anchoring is a way of attaching moaning to a scale. 
Traditionally, meaning has been attached to educational scales by norm- 
referencing, that is, by comparing students at a particular scale level to 
other students. In contrast, the NAEP scale anchoring is accomplished by 
describing what students at selected levels know and can do. This is the 
primary purpose of NAEP. 

The anchoring process is straightforward. There arc several ways to 
anchor a scale, and the conceptually simplest will be described here. Several 
scale levels are selected — they should be far apart to be noticeably different 
but not so far apart as to be trivial. The students are then sorted by scale 
score, and students at or near each level are grouped together. For the group 
at the lowest scale score level, what they could know and can do is defined by 
the items that a vast majority of the students answered correctly. At the 
higher score level, the question is: what is it that students at this level 
know and can do that students at the next lower level cannot. The answer is 
defined by the items that a vast majority of students at this level answered 
correctly but a majority at the next lover level answered incorrectly. The 
assessment items are, therefore, grouped by the levels between which they 
discriminate. Many items do not discriminate between any pair of scale 
points . 

Figure 3 shows a graphical representation of the statistical anchoring 
process. Three items are displayed, identified by the labels "A", "B", and 
"C". Six anchoring levels are identified, corresponding to scale values of 
100, 150, 200, 250, 300 and 350. An item will anchor at a given level if: (1) 
65 to 80 percent of students attaining that level can answer the iteir, (2) the 
probability of success on the Item for students at the next lower level is 
less than 50 percent, and (3) the difference In the probabilities of success 
between the two levels is at least 30 percent points. In Figure 3, Item "A" 
anchors at the 250 level since the probability of correct response for 
students with proficiencies around 25C is 80% while the probability of success 
for students at the next lower level (200) is 40%. Item ''B" anchors at the 
300 level since there is a steep rise in the probability of success between 
250 and 300 and since the probabilities of success at the two levels satisfy 
the threshold values. Item "G" does not anchor at any level because the 
discrimination between adjacent levels is not sufficiently sharp. 

A committee of subject matter experts, educators, and others is then 
assembled to review the items and, using their knowledge of the subject matter 
and student performance, try to generalize from the items to more general 
constructs. Several sample items are selected to illustrate the construct. 
The constructs are then described verbally and sent out for general review by 
professionals in education. 
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SUMMARY 



The purpose of NAEP is to assess and report what students in American 
schools know and can do. Over the years, NAEP has pursued this purpose in a 
number of ways. The introduction of items response theory into NAEP, and 
several innovations in IRT, have resulted in scales that summarize the main 
findings in the vast set of student performance demonstrations that NAEP 
collects. These scales are not only useful to the educational research 
community but are also useful in communicating information about the status of 
education in America to the general public. NAEP not only summarizes the data 
but also makes all of its data available to the research community for 
detailed analyses or alternate summarizations . 
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Figure 1: Example of a complete item- level report 
(Source: Mathematics Report 04-MA-20, 1972-73 Assessment, National Assessment of Educational Progress) 
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Figure 2: Responise Patterns for an Ideal Test 

Subject-by-ltem data matrix sorted by 
number of correct responses and by 
Item difficulty 
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Figure 3: Three example items for Scale Anchorin 
Item "A" Anchors at 250 
Item "B" Anchors at 300 
Item "C" Does Not Anchor 
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